Top-down Extraction of Semi-Structured Data
نویسندگان
چکیده
In this paper, we propose an innovative approach to extracting semi-structured data from Web sources. The idea is to collect a couple of example objects from the user and to use this information to extract new objects from new pages or texts. We propose a top-down strategy that extracts complex objects decomposing them in objects less complex, until atomic objects have been extracted. Through experimentation, we demonstrate that with a small number of given examples our strategy is able to extract most of the objects present in a Web source given as input.
منابع مشابه
Bottom Up and Top Down - Twig Pattern Matching on Indexed Trees
This article describes how to implement efficient memory resident path indexes for semi-structured data. Two techniques are introduced, and they are shown to be significantly faster than previous methods when facing path queries using the descendant axis and wild-cards. The first is conceptually simple and combines inverted lists, selectivity estimation, hit expansion and brute force search. Th...
متن کاملEvaluation of Top-k Queries over Structured and Semi-structured Data
Evaluation of Top-k Queries over Structured and Semi-structured Data
متن کاملHyperset approach to semi-structured databases and the experimental implementation of the query language Delta
This thesis presents practical suggestions towards the implementation of the hyperset approach to semi-structured databases and the associated query language ∆ (Delta). This work can be characterised as part of a top-down approach to semi-structured databases, from theory to practice. Over the last decade the rise of the World-Wide Web has lead to the suggestion for a shift from structured rela...
متن کاملLearning dialogue structures from a corpus
This paper demonstrates some aspects of a plan processor which is a subcomponent of the dialogue module of verb-mobil. We describe how we transfer results from the research area of grammar extraction for the semi-automatic acquisition of plan operators for turn classes. We exploit statistical knowledge acquired during learning the grammar and incorporate top down predictions to enhance the corr...
متن کاملBuilding Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach
Structured community portals extract and integrate information from raw Web pages to present a unified view of entities and relationships in the community. In this paper we argue that to build such portals, a top-down, compositional, and incremental approach is a good way to proceed. Compared to current approaches that employ complex monolithic techniques, this approach is easier to develop, un...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999